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Abstract 

We demonstrate that a dependeney parser 
ean be built using a eredit assignment 
eompiler whieh removes the burden of 
worrying about low-level maehine learn¬ 
ing details from the parser implemen¬ 
tation. The result is a simple parser 
whieh robustly applies to many languages 
that provides similar statistieal and eom- 
putational performanee with best-to-date 
transition-based parsing approaehes, while 
avoiding various downsides ineluding ran¬ 
domization, extra feature requirements, 
and eustom learning algorithms. 

1 Introduction 

Transition-based dependeney parsers have a long 
history, in whieh many aspeets of their eonstrue- 
tion have been studied: transition systems (Nivre, 
2003; Nivre, 2004), feature engineering (Koo et 
al., 2008), neural-network predietors (Chen and 
Manning, 2014) and the importanee of training 
against a “dynamie oraele” (Kuhlmann et al., 
2011; Goldberg and Nivre, 2013). In this paper we 
foeus on an understudied aspeet of building depen¬ 
deney parsers: the role of getting the underlying 
maehine learning teehnology “right”. In eontrast 
to previous approaehes whieh use heuristie learn¬ 
ing strategies, we demonstrate that we ean easily 
build a highly robust dependeney parser with a 
“eompiler” that automatieally translates a simple 
speeifieation of dependeney parsing and labeled 
data into maehine learning updates. 

An issue with eomplex predietion problems is 
eredit assignment: When something goes wrong 
do you blame the first, seeond, or third predietion? 
Existing systems eommonly take two strategies: 

1. The system may ignore the possibility that a 
previous predietion may have been wrong. Or 
ignore that different errors may have differ¬ 
ent eosts (eonsequenees). Or that train-time 


predietion may differ from the test-time pre¬ 
dietion. These and other issues lead to sta¬ 
tistieal ineonsisteney: when features are not 
rieh enough for perfeet predietion the ma¬ 
ehine learning may eonverge suboptimally. 

2. The system may use hand erafted eredit- 
assignment heuristies to eope with errors the 
underlying algorithm makes and the long¬ 
term outeomes of deeisions. 

Here, we show instead that a learning to seareh 
eompiler (Daume III et al., 2014) ean automati¬ 
eally handle eredit assignment using known teeh- 
niques (Daume III et al., 2009; Ross et al., 2011; 
Ross and Bagnell, 2014; Chang et al., 2015) when 
applied to dependeney parsing. Dependeney pars¬ 
ing is more eomplex than previous applieations of 
the eompiler and may also be of interest for other 
similarly eomplex NLP problems as it frees de¬ 
signers to worry about eoneerns other than low- 
level maehine learning. 

The advantage here is the eombination of eor- 
reetness and simplieity via removal of eoneerns: 

1. The system automatieally employs a eost 
sensitive learning algorithm instead of a 
multielass learning algorithm, ensuring the 
model learns to avoid eompounding errors. 

2. The system automatieally “rolls in” with the 
learned poliey and “rolls out” the dynamie or¬ 
aele insuring eompetition with the oraele. 

3. Advaneed maehine learning teehniques or 
optimization strategies are enabled with 
eommand-line flags with no additional imple¬ 
mentation overhead, sueh as neural networks 
or “faney” online learning. 

4. The implementation is future-friendly: future 
eompilers may yield a better parser. 

5. Train/test asynehrony bugs are removed. Es¬ 
sentially, you only write the test-time “de- 
eoder” and the oraele. 

6. The implementation is simple: This one is 
about 300 lines of C-i-i- eode. 



Algorithm 1 RUNTAGGER(worr/5) 

1 : output •<— [] 

2 : for n = i to LEN(words) do 
3: ref ^ words[n]. true Jabel 

4: output[n] •<— PREDICT(words[i], ref, output[:n-l]) 

5: end for 

6: LOSS(# output[n] 7 ^ words[n]. truejabel) 

7: return output 


Experiments on standard English Penn Tree- 
bank and nine other languages from CoNEE-X 
show that the compiled parser is competitive with 
recent published results (e.g., an average labeled 
accuracy of 81.7 over 10 languages, versus 80.3 
for (Goldberg and Nivre, 2013)). 

Altogether, this system provides a strong simple 
baseline for future research on dependency pars¬ 
ing, and demonstrates that the compiler approach 
to solving complex prediction problems may be of 
broader interest. 

2 Learning to Search 

Eeaming to search is a family of approaches for 
solving structured prediction tasks. This family in¬ 
cludes a number of specihc algorithms including 
the incremental structured perceptron (Collins and 
Roark, 2004; Huang et al., 2012), Searn (Daume 
ITT et al., 2009), DAGGER (Ross et al., 2011), 
Aggrevate (Ross and Bagnell, 2014), and oth¬ 
ers (Daume III and Marcu, 2005; Xu and Eern, 
2007; Xu et al., 2007; Ratliff et al., 2007; Syed and 
Schapire, 2011; Doppa et al., 2012; Doppa et al., 
2014). Eeaming to search approaches solve struc¬ 
tured prediction problems by (1) decomposing the 
production of the structured output in terms of an 
explicit search space (states, actions, etc.); and (2) 
learning hypotheses that control a policy that takes 
actions in this search space. 

In this work we build on recent theoretical and 
implementational advances in learning to search 
that make development of novel structured predic¬ 
tion frameworks easy and efficient using “imper¬ 
ative learning to search” (Daume III et al., 2014). 
In this framework, an application developer needs 
to write (a) a “decoder” for the target structured 
prediction task (e.g., dependency parsing), (b) an 
annotation in the decoder that computes losses on 
the training data, and (c) a reference policy on the 
training data that returns at any prediction point 
a “suggestion” as to a good action to take at that 
stated 

'Some papers in the past make an implicit or explicit as- 



Eigure 1: A search space implicitly defined by an 
imperative program. The system begins at the start 
state S and chooses the middle among three ac¬ 
tions by the rollin policy twice. At state R it con¬ 
siders both the chosen action (middle) and both 
one-step deviations from that action (top and bot¬ 
tom). Each of these deviations is completed using 
the rollout policy until an end state is reached, at 
which point the loss is collected. Here, we learn 
that deviating to the top action (instead of middle) 
at state R decreases the loss by 0.2. 

Algorithm 1 shows the code one must write for 
a part of speech tagger (or generic sequence la¬ 
beler) under Hamming loss. The only annotation 
in this code aside from the calls to the library func¬ 
tion PREDICT are the computation of an reference 
(an oracle reference is trivial under Hamming loss) 
and the computation of the total sequence loss at 
the end of the function. Note that in this example, 
the prediction of the tag for the nth word depends 
explicitly on the predictions of all previous words! 

The machine learning question that arises is 
how to learn a good PREDICT function given 
just this information. The “imperative learning to 
search” answer (Daume III et al., 2014) is es¬ 
sentially to run the RunTagger function many 
times, “trying out” different versions of PREDICT 
in order to learn one that yields low LOSS. The 
challenge is how to do this efficiently. The general 
strategy is, for some number of epochs, and for 
each example (x, y) in the training data, to do the 


sumption that this reference policy is an oracle policy: for ev¬ 
ery state, it always chooses the best action (assuming it gets 
to make all future decisions as well). 







following: 

1. Execute RunTagger on x with some rollin 
policy to obtain a search trajectory (sequence 
of action a) and loss £o 

2. Many times: 

(a) Choose some time step f < |a| 

(b) Choose an alternative action a[ / at 

(c) Execute RunTagger on x, with PRE¬ 
DICT return ai:t-i initially, then aj, then 
acting according to a rollout policy to ob¬ 
tain a new loss it,a[ 

(d) Compare the overall losses Iq and £t,a[ 
to construct a classification/regression 
example that demonstrates how much 
better or worse aj is than at in this con¬ 
text 

3. Update the learned policy 

Eigure 1 shows a schematic of the search space 
implicitly defined by an imperative program. By 
executing this program three times (in this exam¬ 
ple), we are able to explore three different trajec¬ 
tories and compute their losses. These trajectories 
are defined by fhe rollin policy (whaf defermines 
fhe initial frajecfory), fhe posifion of one-sfep devi¬ 
ations (here, sfafe K), and fhe rollout policy (which 
complefes fhe frajecfory affer a deviafion). 

By varying fhe rollin policy, fhe rollouf pol¬ 
icy and fhe manner in which classificafion/regres- 
sion examples are creafed, fhis general frame¬ 
work can mimic algorifhms like Searn, DAG¬ 
GER and Aggrevate. Eor insfance, DAGGER 
uses rollin=learned policy^ and rollout=reference, 
while Searn uses rollin=rollout=slochastic mix- 
lure of learned and reference policies. 

3 Dependency Parsing by Learning to 
Search 

Eearning fo search provides a nafural framework 
for implementing a Iransifion-based dependency 
parser. A Iransifion-based dependency parser lakes 
a sequence of aclions and parses a sentence from 
lefl fo righf by mainfaining a stack S, a buffer B, 
and a sel of dependency arcs A. The slack main- 
lains parlial parses, fhe buffer stores fhe words fo 
be parsed, and A keeps fhe arcs lhaf have been 
generated so far. The configuralion of fhe parser 
al each slage can be defined by a Iriple (5, B, A). 
Eor fhe ease of nolalion, we use Wp fo represenl fhe 

^Technically, DAGGER rolls in with a mixture which is al¬ 
most always instantiated to be “reference” for the first epoch 
and “learned” for subsequent epochs. 


Algorithm 2 Trans(5, B, A, action) 

1 : Let Wp be the leftmost element in B 
2 : if action = SHIFT then 
3: S'.push(wp) 

4: remove Wp from B 

5: else if action= Reduce-Left then 

6: top -C- S'.pOpO 

7: A -C- AU (Wp.tOp) 

8: else if action = Reduce-Right then 

9: top -C- S'.pOpO 

10 : A t— Au (S'.topO, top) 

II: end if 

12: return S, B, A 


leftmost word in the buffer and use si and S 2 to de¬ 
note the top and the second top words in the stack. 
A dependency arc {wp, Wm) is a directed edge that 
indicates word wp is the parent of word Wm- When 
the parser terminates, the arcs in A form a projec¬ 
tive dependency tree. We assume that each word 
only has one parent in the derived dependency 
parse tree, and use yl[r(;m] to denote the parent of 
word Wm- Eor labeled dependency parsing, we fur¬ 
ther assign a tag to each arc representing the de¬ 
pendency type between the head and the modifier. 
Eor simplicily, we assume an unlabeled parser in 
fhe following descriplion. The extension from an 
unlabeled parser fo a labeled parser is slraighlfor- 
ward, and is discussed af fhe end of Ibis section. 

We consider an arc-hybrid Iransilion sys¬ 
tem (Kuhlmann el ah, 2011)^. In fhe initial con¬ 
figuration, fhe buffer B conlains all fhe words in 
fhe sentence, a dummy roof node is pushed in fhe 
slack S, and fhe sel of arcs A is emply. The roof 
node cannol be popped oul al anytime during pars¬ 
ing. The system Ihen lakes a sequence of actions 
unlil fhe buffer is emply and fhe slack conlains 
only Iherool node (i.e., \B\ = 0 and S' = {Root}). 
When Ihe process terminates, a parse free is de¬ 
rived. Al each slate, Ihe system can lake one of Ihe 
following actions: 

1. Shift: push Wp to S and move p to fhe nexl 
word. (Valid when \B\ >0). 

2. Reduce-left: add an arc {wp, si) to A and 
pop si. (Valid when \B\ > 0 and |5| > 1). 

3. Reduce-RIGHT: add an arc (s 2 , si) to A and 
pop si. (Valid when IS"] > 1). 

^The learning to search framework is also suitable for 
other transition-based dependency parsing systems, such as 
arc-eager (Nivre, 2003) or arc-standard (Nivre, 2004) transi¬ 
tion systems. 




Action 

Stack Buffer 

Configuration 

Arcs 


[Root] [Flying planes can be dangerous] {} 

Shift 

[Root Flying] [planes can be dangerous] 

{} 

Reduce-left 

[Root] [planes can be dangerous] 

{(planes. Flying)} 

Shift 

[Root planes] [can be dangerous] 

{(planes. Flying)} 

Reduce-left 

[Root] [can be dangerous] 

{(planes. Flying), (can, planes)} 

Shift 

[Root can] [be dangerous] 

{(planes. Flying), (can, planes)} 

Shift 

[Root can be] [dangerous] 

{(planes. Flying), (can, planes)} 

Shift 

[Root can be dangerous] [] 

{(planes. Flying), (can, planes)} 

Reduce-Right 

[Root can be] [] 

{(planes. Flying), (can, planes), (be, dangerous)} 

Reduce-Right 

[Root can] [] 

{(planes. Flying), (can, planes), (be, dangerous), (can, be)} 

Reduce-Right 

[Root] [] 

{(planes. Flying), (can, planes), (be, dangerous), (can, be), (Root, can)} 



Root Flying planes can be dangerous Root Flying planes can be dangerous 
Parse tree derived by the above parser Gold parse tree 

Figure 2: An illustrative example of an arc-hybrid transition parser. The above table show the actions 
taken and the intermediate configurations generated by a parser. The parse tree derived by the parser is 
in the bottom left, and the gold parse tree is the bottom right. The distance between these two trees is 2. 


Algorithm 3 RUNPARSER(5entence) 

1: Stack S <— {Root} 

2 ; buffer B <r- [words in sentence] 

3: arcs A t— 0 

4; while B / 0 or I S'! > 1 do 
5: ValidActs t— GetValidActions(S, B) 

6: features t— GetFeAXIS', B, A) 

1-. ref t— GetGoldAction(S, B) 

8: action t— PREDICTffeatures, ref, ValidActs) 

9: S, B, A t—T rans(S, B, A, action) 

10: end while 

11 : LOSS(A[w] a A*[tu], Vw G sentence) 

12 : return output 


Algorithm 2 shows the execution of these actions 
during parsing, and Figure 2 demonstrates an ex¬ 
ample of transition-based dependency parsing. 

We can define a search space for dependency 
parser such fhaf each sfafe represenfs one config¬ 
uration during fhe parsing. The sfarf sfafe is asso- 
ciafed wifh fhe inifial configuration, and fhe end 
sfafes are associafed wifh fhe configurafions fhaf 
\B\ =0 and S = {Root}. The loss of each end 
sfafe is defined by fhe disfance befween fhe derived 
parse free and fhe gold parse free. The above fran- 
sifion actions define how fo move from one search 
sfafe fo fhe ofher. In fhe following, we describe our 
implemenfafion defails. 

Implementation As mentioned in Section 2, to 
implement a parser using the learning to search 
framework, we need to provide a decoder, a loss 
function and reference policy. Thanks to recent 
work (Goldberg and Nivre, 2013), we know how 
to compute a “dynamic oracle” reference policy 
that is optimal. The loss can be measured by how 


many parents are different between the derived 
parse tree and the gold annotated parse tree. Al¬ 
gorithm 3 shows the pseudo-code of a decoder for 
a unlabeled dependency parser. We discuss each 
subcomponent below. 

• GetValidAction returns a set of valid ac¬ 
tions that can be taken based on the current 
configuration. 

• GetFeat extracts features based on the cur¬ 
rent configuration. The features depend on 
the top few words in the stack and leftmost 
few words in the buffer as well as their as¬ 
sociated part-of-speech tags. We list our fea¬ 
ture templates in Table 1. All features are 
generated dynamically because configuration 
changes during parsing. 

• GetGoldAction implements the dynamic 
oracle described in (Goldberg and Nivre, 
2013). The dynamic oracle returns the opti¬ 
mal action in any state that leads to the reach¬ 
able end state with the minimal loss. 

• Predict is a library call implemented in 
the learning to search system. Given training 
samples, the learning to search system can 
learn the policy automatically. Therefore, in 
the test phase, this function returns the pre¬ 
dicted action leading to an end state with 
small structured loss. 

• Trans function implements the hybrid-arc 
transition system. Based on the predicted ac- 




tion and labels, it updates the parser’s eonfig- 
uration, and move the agent to the next seareh 
state. 


• Loss funetion is used to measure the dis- 
tanee between the predieted output and the 
gold annotation. Here, we simply used the 
number of words for whieh the parent is 
wrong as the loss. The LOSS has no effeet in 
the test phase. 

The above deeoder implements an unlabeled 
parser. To build a labeled parser, when the transi¬ 
tion aetion is Reduce-left or Reduce-right, 
we eall the PREDICT funetion again to prediet 
the dependeney type of the are. The loss in the 
labled dependneey parser ean be measured by 
loss(wi}, where 


loss{wi) 


2 A[wi]i^ A*[wi\ 

< 1 A[wi] = A*[wi],L[wi] / L*[wi] 

0 Otherwise. 

( 1 ) 


Unigram Features 

Si, S2, S3, bi,b 2 , 63, ^i(si), ^2(51), 

i?l(gl), f?l(g2), -^2(^1), -^1(^2) 

Bigram Features 

sisi, S2S2, S3S3, bibi, 6262, & 3 & 3 , sibi, 

S1S2. bib2 _ 

Trigram Features 

S1S2S3 , Sibib 2 , SiS2^1, Sl^l&3, 

bib 2 bz , siRi{si)Ri{s 2 ), siL 2 {si)L 2 {bi), 

61^1(61)^2(61), siS2-^-i(6i), si6iLi(si), 

si6iLi(g2), si6iLi(6i) _ 

Table 1: Features used in our dependeney parsing 
system. Si represents the z-th top element in the 
staek S. bi is the z-th leftmost word in the buffer 
B. Li{w) and Ri{w) are the z-th leftmost ehild and 
rightmost ehild of the word w. For eaeh feature 
template, we ineludes the surfaee string and the as- 
soeiated part-of-speeeh (POS) tag as features. For 
Ri{w) and Li{w), we also inelude are labels as 
features. A feature hashing teehnique (Weinberger 
et al., 2009) is employed to provide a fast feature 
lookup. 


A[wi] and are the parent of Wi in the de¬ 

rived parse tree and gold parse tree, respeetively, 
L[wi] is the label assign to the are {A[wi],Wi). 
We observe that this simple loss funetion performs 
well empirieally. 

We implemented our parser based on an open- 
souree library supporting learning to seareh. The 
implementation requires about 300 lines of C-i-i- 
eode. The reduetion of implementation effort 
eomes from two-folds. First, in the learning to 
seareh framework, there is no need to implement 
a learning algorithm. Onee the deeoding funetion 
is defined, the system is able to learn the best 
“Predict” funetion from training data. Seeond, 
L2S provides a unified framework, whieh allows 
fhe library fo serve eommon funelions for ease of 
implemenfafion. For example, quadrafie and eubie 
fealure generafing funelions and a fealure hashing 
meehanism are provided by fhe library. The uni¬ 
fied framework also allows a user fo experimenl 
wilh differenl base learners and hyper-paramefers 
using eommand line argumenfs wifhouf modifying 
fhe eode. 

Base Learner As mentioned in Seefion 2, fhe 
learning fo seareh framework reduees sfruefured 
prediefion fo eosf-sensifive mulfi-elass elassifiea- 
fion, whieh ean be furlher redueed fo regression. 
This reduefion framework allows us fo employ 


Parser 

Transition 

Base learner 

Referenee 

L2S 

are-hybrid 

NN 

Dynamie 

Dyna 

are-hybrid 

pereeplron 

Dynamie 

Snn 

are-slandard 

NN 

Slalie 


Table 2: Parser sellings. 


well-sludied binary and mulfi-elass elassifiealion 
melhods as fhe base learner. We analyze fhe value 
of using more powerful base learners in fhe exper¬ 
imenl seefion. 

4 Experimental Results 

While mosl work eompares wilh MalfParser or 
MSTParser, whieh are indeed weak baselines, we 
eompare wilh Iwo reeenl slrong baselines: fhe 
greedy Iransilion-based parser wilh dynamie or- 
aele (Goldberg and Nivre, 2013) and Ihe Slan- 
ford neural nelwork parser (Chen and Manning, 
2014). We evaluate on a wide range of differenl 
languages, and show lhal our parser aehieves eom- 
parable or belter resulls on all languages, wilh sig- 
nilieanlly less engineering. 



Parser 

Ar 

Bu 

Ch 

DA 

Du 

En 

Ja 

PO 

Sl 

Sw 

Avg 


UAS 

L2S 

77.59 

90.64 

90.46 

88.03 

78.06 

92.30 

90.89 

89.77 

81.28 

89.12 

86.81 

Dyna 

77.89 

89.54 

89.41 

87.37 

74.63 

91.84 

92.72 

85.82 

77.14 

87.85 

85.42 

Snn 

67.37* 

88.05 

87.31 

82.98 

75.34 

90.20 

89.45 

83.19* 

63.60* 

85.70 

81.32* 


LAS 

L2S 

66.44 

85.07 

86.43 

81.36 

73.55 

91.09 

89.53 

84.68 

72.48 

82.81 

81.34 

Dyna 

66.33 

84.73 

85.14 

82.30 

70.26 

90.81 

90.91 

82.00 

68.65 

82.21 

80.33 

Snn 

51.72* 

84.01 

82.72 

77.44 

71.96 

89.10 

87.37 

77.88* 

51.08* 

80.09 

75.34* 


Table 3: UAS and LAS on PTB and CoNLL-X. The average seore over all languages is shown in the 
last eolumn. The best seores for eaeh language is bolded. Snn makes assumptions about the strueture of 
languages and henee obtains substantially worse performanee on languages with multi-root trees (marked 
with *). Exeluding these languages, Snn aehieves 85.6 (UAS) and 81.8 (LAS) in average, while L2S 
aehieves 88.5 and 84.3. 


4.1 Datasets 

We eonduet experiments on the English Penn 
Treebank (PTB) (Mareus et al., 1993) and the 
CoNLL-X (Buehholz and Marsi, 2006) datasets 
for 9 other languages, ineluding Arabie, Bul¬ 
garian, Chinese, Danish, Duteh, Japanese, Por¬ 
tuguese, Slovene and Swedish. Eor PTB, we eon- 
vert the eonstitueney trees to dependeneies by the 
head rules of Yamada and Matsumoto (2006). We 
follow the standard split: seetions 2 to 21 for train¬ 
ing, seetion 22 for development and seetion 23 
for testing. The POS tags in the evaluation data is 
assigned by the Stanford POS tagger (Toutanova 
et al., 2003), whieh has an aeeuraey of 97.2% on 
the PTB test set. Eor CoNLL-X, we use the given 
train/test splits and reserve the last 10% of train¬ 
ing data for development if needed. The gold POS 
tags given in the CoNLL-X datasets are used. 

4.2 Setup and Parameters 

Eor L2S, the rohin poliey is a mixture of the 
eurrent (learned) poliey and the referenee (dy- 
namie oraele) poliey. The probability of exeeuting 
the referenee poliey deereases over eaeh round. 
Speeifieally, we set it to be 1 — (1 — a)*, where 
t is the number of rounds and a is set to 10“® in 
ah experiments. It has been shown (Ross and Bag- 
nell, 2014; Chang et al., 2015) that when the ref¬ 
erenee poliey is optimal, it is preferable to roll out 
with the referenee. Therefore, we roll out with the 
dynamie oraele (Goldberg and Nivre, 2013). 

Our base learner is a simple neural network with 
one hidden layer. The hidden layer size is 5 and 


we do not use word or POS tag embeddings. We 
find the Eohow-the-Regularized-Leader-Proximal 
(ETRL) online learning algorithm partieularly ef- 
feetive with learning the neural network and sim¬ 
ply use default hyperparameters. 

We eompare with the reeent transition-based 
parser with dynamie oraeles (Dyna) (Goldberg 
and Nivre, 2013), and the Stanford neural network 
parser (Snn) (Chen and Manning, 2014). Settings 
of the three parsers are shown in Table 2. 

Eor Dyna, we use the software provided by 
the authors online^. Our initial experiments show 
that its performanee is the best using the are hy¬ 
brid system with exploration parameters k = I, 
p = 1, thus we use this setting for ah experiments. 
The best model evaluated on the development set 
among 5 runs with different random seeds are eho- 
sen for testing. 

Eor Snn, we use the latest Stanford parser.^ 
Sinee ah other parsers do not use external re- 
sourees, we do not provide pretrained word em¬ 
beddings and initialize randomly. We use the same 
parameter values as suggested in (Chen and Man¬ 
ning, 2014), whieh are also the default settings of 
the software. The best model over 20000 iterations 
evaluated on the development set is used for test¬ 
ing.^ 

In addition, we eompare with the RedShift^ 

Available at https : / /bitbucket. org/yoavgo/ 
tacl2013dynamicoracles 

^Available at http://nlp.stanford.edu/ 
software/nndep.shtml 

^Enabled by -saveintermediate. 

'^Available at https :/ / git hub . com/syllogism/ 






Base Leaner 

Dev 

UAS LAS 

Test 

UAS LAS 

SGD 

89.34 

88.03 

89.34 

87.89 

SGD+ 

91.0 

89.5 

91.0 

89.6 

NN 

92.02 

90.78 

91.97 

90.84 

NN-tFTRL 

92.27 

91.04 

92.30 

91.09 

Multiclass 

91.7 

90.6 

91.3 

90.2 


Table 4: Performance of different base learning al¬ 
gorithms with the L2S parser on PTB corpus. 


parser on PTB. For fair comparison, we only use 
its basic features (excluding features based on the 
Brown cluster). We use the default parameters, 
which runs a beam search with width 8. In our ex¬ 
periments, the RedShift parser has UAS 92.10 and 
LAS 90.83 on the PTB test set. 

4.3 Results 

We report unlabeled attachment scores (UAS) and 
labeled attachment scores (LAS) in Table 3. Punc¬ 
tuation is excluded in all evaluations. Our parser 
achieves up to 4% improvement on both UAS and 
LAS. Compared with Dyna, our parser has the 
same transition system and oracle but more pow¬ 
erful base learners to choose from. Compared with 
Snn, we use much fewer hidden units and param¬ 
eters to tune. 

The Value of Strong Base Learners. L2S allows 
us to leverage well studied classification meth¬ 
ods. We show the performance when training with 
base learners using the following update rules. Un¬ 
less stated otherwise all the base learners are cost- 
sensitive multiclass classifiers. 

1. SGD: stochastic gradient descent updates. 

2. SGD-I-: improved update rule using an adap¬ 
tive metric (Duchi et al., 2011; McMahan 
and Streeter, 2010), importance invariant up¬ 
dates (Karampatziakis and Langford, 2011), 
and normalized updates (Ross et al., 2013). 

3. NN: a single-hidden-layer neural network 
with 5 hidden nodes. 

4. NN -I- FTRL: a neural network learner with 
follow-the-regularized-leader regularization 
(the base learner in the above experiments). 

5. Multiclass: a multiclass classifier using 
NN-i-FTRL update rules. The gold label is 
given by the dynamic oracle. 

The results in Table 4 show that using a strong 

redshift 


Base Leaner 

Dev 

UAS LAS 

Test 

UAS LAS 

Uni-gram 

80.41 

78.01 

80.97 

78.65 

Uni- - 1 - Bi-gram 

90.73 

89.46 

91.08 

89.81 

All features 

92.27 

91.04 

92.30 

91.09 


Table 5: The contribution of Bi-gram and Tri-gram 
features. Results are evaluated on the dev and the 
test set of PTB. 

base learner and taking care of low-level learning 
details (i.e., using cost-sensitive multiclass classi¬ 
fier) can improve the performance. 

Finally, Figure 5 shows the performance of dif¬ 
ferent feature templates. Using a comprehensive 
set of features leads to a better dependency parser. 

5 Related Work 

Training a transition-based dependency parser can 
be viewed as an imitation learning problem. How¬ 
ever, most early works focus on decoding or fea¬ 
ture engineering instead of the core learning al¬ 
gorithm. For a long time, averaged perceptron is 
the default learner for dependency parsing. Gold¬ 
berg and Nivre (2013) first proposed dynamic or¬ 
acles under the framework of imitation learning. 
Their approach is essentially a special case of our 
algorithm: the base learner is a multi-class percep¬ 
tron, and no rollout is executed to assign cost to 
actions. In this work, we combine dynamic ora¬ 
cles into learning and explore the search space in 
a more principled way by learning to search: by 
cost-sensitive classification, we evaluate the end 
result of each non-optimal action instead of treat¬ 
ing them as equally bad. 

There are a number of works that use the 
L2S approach to solve various other structured 
prediction problems, for example, sequence la¬ 
beling (Doppa et al., 2014), coreference resolu¬ 
tion (Ma et al., 204), graph-based dependency 
parsing (He et al., 2013). However, these works 
can be considered as a special setting under our 
unified learning framework, e.g., with a custom 
action set or different rollin/rollout methods. 

To our knowledge, this is the first work that 
develops a general programming interface for de¬ 
pendency parsing, or more broadly, for structured 
prediction. Our system bears some resemblance 
to probabilistic programming language (e.g., (Mc- 
Callum et al., 2009; Gordon et al., 2014)), how- 



ever, instead of relying on a new programming lan¬ 
guage, ours is implemented in C-i-i- and Python, 
thus is easily accessible. 

6 Conclusion and Discussion 

We have described a simple transition-based de¬ 
pendency parser based on the learning to search 
framework. We show that it is now much eas¬ 
ier to implement a high-performance dependency 
parser. Furthermore, we provide a wide range of 
advanced optimization methods to choose from 
during training. Experimental results show that we 
consistently achieve better performance across 10 
languages. An interesting direction for future work 
is to extend the current system beyond greedy 
search. In addition, there is a large room for speed¬ 
ing up training time by smartly choosing where to 
rollout. 
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Function 

Number of lines 

Dependency Parser 

Number of lines 

Setup 

90 

L2S (ours) 

~300 

GetValidActions 

17 

Stanford 

~3K 

GetFeat 

86 

RedShift 

~2K 

GetGoldAction 

41 

(Goldberg and Nivre, 2013) 

~4K 

Trans 

28 

Malt Parser 

~10K 

RunParser 

40 



Total 

331 




Table 6: Number of code lines of our de¬ 
pendency parser implementation. The “Setup” 
contains class constructor, destructor, and han¬ 
dlers for the learning to search framework. 


Table 7: Number of lines of dependency parser 
implementations. 


We implemented our dependency parser at Vowpal Wabbit (http: //hunch. net / -vw/), a ma¬ 
chine learning system supporting online learning, hashing, reductions, and L2S. Table 6 shows the num¬ 
ber of code lines for each function in our implementation, and Table 7 shows the number of lines of 
other popular dependency parsing systems. Redshift and Goldberg and Nivre (2013) are implemented 
in Python. Stanford and Malt Parser are in Java. Our implementation is in C-i-i-. C-i-i- is usually more 
lengthy than Python and is competitive to Java. The code is readable and contains proper comments as 
shown below. 


1 


3 

5 

7 


9 

11 


13 


15 

17 

19 

21 

23 

25 

27 

29 

31 

33 

35 

37 

39 

41 

43 


#include "search_dep_parser. h" 
finclude "gd.h” 

#include "cost_sensitive .h" 

#define val_namespace 100 // valency and distance feature space 
#define offset_const 344429 

namespace DepParserTask { Search::search_task task = { "dep_parser " , run, initialize, finish, setup 

, nullptr}; } 

struct task_data { 
example *ex; 

size_t root_label, num_label; 

v_array<uint32_t> valid_actions, valid_labels, action_loss, gold_heads, gold_tags, stack, heads, tags, 
temp; 

v_array<uint32_t> children[6]; // [0] : num_left_arcs , [1] : num_right_arcs; [2]: leftmost_arc, [3]: 

second_leftmost_arc, [4] : rightmost_arc, [5] : second_rightmost_arc 
example * ec_buf[13]; 


namespace DepParserTask { 
using namespace Search; 

void initialize (Search::searchSt srn, size_t& num_actions, po::variables_map& vm) { 
task_data *data = new task_data(); 
data->action_loss.resize(4, true) ; 
data->ex = NULL; 
srn.set_num_learners (3); 
srn.set_task_data<task_data>(data); 

po::options_description dparser_opts ("dependency parser options") ; 
dparser_opts.add_options() 

("root_label", po::value<size_t>(&(data->root_label))->default_value(8), "Ensure that there is only 
one root in each sentence") 

( "num_label" , po::value<size_t>(&(data->num_label))->default_value(12), "Number of arc labels" ); 
srn.add_program_options(vm, dparser_opts); 

for(size_t i=l; i<=data->num_label;i++) 
if (i!=data->root_label) 

data->valid_labels.push_back(i); 

data->ex = alloc_examples (sizeof (polylabel), 1); 
data->ex->indices.push_back(val_namespace); 
for(size_t i=l; i<14; i++) 

data->ex->indices.push_back( (unsigned char) i+' A' ); 
data->ex->indices.push_back(constant_namespace); 

VW& all = srn.get_vw_pointer_unsafe(); 


const char* pair[] = 

, "dD", "dE", "dF", 

{"BC", "BE", "BB" 
"dG", "dd"}; 

, "CC", 

"DD", 

"EE", 

"FF", "GG", "EF", "BH", "BJ", 

"EL", 

"dB", 

const char* tripled 
BEH", "BEN", "BEJ"}; 

= {"EFG", "BEF", 

"BCE", 

"BCD", 

"BEL", 

, "ELM", "BHI", "BCC", "BJE", 

"BHE", 

"BJK 


45 
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vector<string> newpairs(pair, pair+19); 
vector<string> newtriples(triple, triple+14); 
all.pairs.swap(newpairs); 
all.triples.swap(newtriples); 

srn.set_Options(AUTO_CONDITION_FEATURES | NO_CACHING); 

srn.set_label_parser( COST_SENSITIVE::cs_label, [](polylabel&l) -> bool { return 1.cs.costs.size() == 0; 

}) ; 


void finish(Search::searchs srn) { 

task_data *data = srn.get_task_data<task_data>(); 

data->valid_actions.delete_v(); 

data->valid_labels.delete_v(); 

data->gold_heads.delete_v(); 

data->gold_tags.delete_v(); 

data->stack.delete_v(); 

data->heads.delete_v(); 

data->tags.delete_v() ; 

data->temp.delete_v(); 

data->action_loss.delete_v(); 

dealloc_example(COST_SENSITIVE::cs_label.delete_label, *data->ex) ; 
free(data->ex); 

for (size_t i=0; i<6; i++) data->children[i].delete_v(); 
delete data; 


void inline add_feature(example *ex, uint32_t idx, unsigned char ns, size_t mask, uint32_t multiplier){ 
feature f = {l.Of, (idx * multiplier) & (uint32_t)mask); 
ex->atomics[ (int) ns].push_back(f); 

} 


void inline reset_ex(example +ex) { 
ex->num_features = 0; 
ex->total_sum_feat_sq = 0; 

for(unsigned char *ns = ex->indices.begin; ns!=ex->indices.end; ns++){ 
ex->sum_feat_sq[ (int ) *ns ] = 0; 
ex->atomics[ (int ) *ns ].erase (); 



// arc-hybrid System. 

uint32_t transition_hybrid(Search::searchs srn, uint32_t a_id, uint32_t idx, uint32_t t_id) { 
task_data *data = srn.get_task_data<task_data>(); 

v_array<uint32_t> &heads=data->heads, &stack=data->stack, &gold_heads=data->gold_heads, &gold_tags=data 
->gold_tags, stags = data->tags; 
v_array<uint32_t> *children = data->children; 
switch (a_id) { 
case 1: //SHIFT 

stack.push_back(idx); 
return idx+1; 
case 2; //RIGHT 

heads[stack.last ()] = stack[stack.size ()-2]; 

children[5][stack[stack.size()-2]]=children[4][stack[stack.size()-2]]; 
children[4] [stack[stack.size()-2]]=stack.last() ; 
children[1] [stack[stack.size ()-2]]++; 
tags[stack.last 0] = t_id; 

srn.loss(gold_heads[stack.last()] != heads[stack.last()]?2:(gold_tags[stack.last()] != t_id)?1:0); 

stack.pop (); 
return idx; 
case 3: //LEFT 

heads[stack.last 0] = idx; 
children[3][idx]=children [ 2 ] [idx]; 
children[2][idx]=stack.last(); 
children[0][idx]++; 
tags[stack.last 0] = t_id; 

srn.loss(gold_heads[stack.last()] != heads[stack.last()]?2:(gold_tags[stack.last()] != t_id)?1:0); 

stack.pop (); 
return idx; 

} 

return idx; 


void extract_features(Search::searchs srn, uint32_t idx, vector<example*> &ec) { 

VW& all = srn.get_vw_pointer_unsafe(); 
task_data *data = srn.get_task_data<task_data>(); 
reset_ex(data->ex); 
size_t mask = srn.get_mask(); 

uint32_t multiplier = all.wpp << all.reg.stride_shift; 

v_array<uint32_t> &stack = data->stack, stags = data->tags, *children = data->children, stemp=data->temp 

example **ec_buf = data->ec_buf; 
example sex = * (data->ex); 

add_feature(Sex, (uint32_t) constant, constant_namespace, mask, multiplier); 
size_t n = ec.sizeO; 
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for{size_t i=0; i<13; i++) 
ec_buf[i] = nullptr; 

// feature based on the top three examples in stack ec_buf[0]: si, ec_buf[l]: s2, ec_buf[2]: s3 
for{size_t i=0; i<3; i++) 

ec_buf[i] = (stack.size 0>i && * (stack.end-(i + 1))!=0) ? ec [* (stack.end-(i+1))-1] : 0; 

// features based on examples in string buffer ec_buf[3]: bl, ec_buf[4]: b2, ec_buf[5]: b3 
for(size_t i=3; i<6; i++) 

ec_buf[i] = (idx+(i-3)-l < n) ? ec[idx+i-3-1] : 0; 

// features based on the leftmost and the rightmost children of the top element stack ec_buf[6]: sll, 
ec_buf[7]: sl2, ec_buf[8]: srl, ec_buf[9]: sr2; 
for(size_t i=6; i<10; i++) { 

if (!stack.empty() && stack.lastO != 0&& children[i-4] [stack.last ()]!=0) 
ec_buf[i] = ec[children[i-4][stack.last()]-1]; 

} 


// features based on leftmost children of the top element in bufer ec_buf[10]: bll, ec_buf[ll]: bl2 
for(size_t i=10; i<12; i++) 

ec_buf[i] = (idx <=n && children[i-8][idx]!=0)? ec[children[i-8][idx]-1] : 0; 

ec_buf[12] = (stack.size 0>1 && * (stack.end-2) !=0 && children[2 ][★ (stack.end-2)]!=0)? ec[children[2 ][* ( 
stack.end-2)]-1]:0; 

// unigram features 
uint64_t vO; 

for(size_t i=0; i<13; i++) { 

for (unsigned char* fs = ec[0]->indices.begin; fs != ec[0]->indices.end; fs++) { 

if(*fs == constant_namespace) // ignore constant_namespace 
continue; 

uint32_t additional_offset = (uint32_t) (i*off set_const); 
if (!ec_buf [i]) { 

for(size_t k=0; k<ec[0]->atomics[ *f s] .size (); k++) { 
vO = affix_constant*( (*fs+l) *quadratic_constant + k); 

add_feature(&ex, (uint32_t) vO + additional_offset, (unsigned char) ((i+l)+'A'), mask, multiplier 


else { 

for(size_t k=0; k<ec_buf[i]->atomics[ * fs] .size (); k++) { 

vO = (ec_buf[i]->atomics[ *f s][k].weight_index / multiplier); 

add_feature(&ex, (uint32_t) vO + additional_offset, (unsigned char) ((i+l)+'A')^ mask, multiplier 


) ; 



// Other features 
temp.resize(10, true) ; 
temp[0] = stack.empty 0? 0 
temp[l] = stack.empty 0? 1 
temp[2] = stack.empty 0? 1 
temp[3] = idx>n? 1: l+min(5 
for(size_t i=4; i<8; i++) 

temp[i] = (!stack.empty() 
for(size_t i=8; i<10; i++) 

temp[i] = (idx <=n && children[i-6][idx]!=0)? tags[children[i-6][idx]] 


(idx >n? 1: 2+min(5, idx - stack.last ())); 

1+min(5, children[0] [stack.last ()]); 

1+min(5, children[1] [stack.last ()]); 
children[0][idx]); 

&& children[i-2] [stack.last ()] !=0)?tags[children[i-2] [stack.last ()]]:15; 

15; 


size_t additional_offset = val_namespace*offset_const; 
for(int j=0; j< 10;j++) { 

additional_offset += j* 1023; 

add_feature(&ex, temp[j]+ additional_offset , val_namespace, mask, multiplier); 


size_t count=0; 

for (unsigned char* ns = data->ex->indices-begin; ns != data->ex->indices.end; ns++) { 

data->ex->sum_feat_sq[ (int ) *ns] = (float) data->ex->atomics[ (int ) *ns ].size(); 
count+= data->ex->atomics[ (int ) *ns ].size(); 

} 

for (vector<string>::iterator i = all.pairs.begin(); i != all.pairs.end();i++) 

count += data->ex->atomics[(int) (*i)[0]].size()* data->ex->atomics[ (int) (*i) [1]] .sizeO; 
for (vector<string>::iterator i = all.triples.begin(); i != all.triples.end();i++) 

count += data->ex->atomics[(int)(*i)[0]].size()*data->ex->atomics[ (int) (*i)[l]].size()*data->ex-> 
atomics [ (int) (*i) [2]] .sizeO; 
data->ex->num_features = count; 
data->ex->total_sum_feat_sq = (float) count; 


void get_valid_actions(v_array<uint32_t> & valid_action, uint32_t idx, uint32_t n, uint32_t stack_depth, 
uint32_t state) { 
valid_action.erase() ; 
if(idx<=n) // SHIFT 

valid_action.push_back(1); 
if (stack_depth >=2) // RIGHT 
valid_action.push_back(2); 
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if(stack_depth >=1 && state!=0 && idx<=n) // LEFT 
valid_action.push_back(3); 

} 


bool is_valid(uint32_t action, v_array<uint32_t> valid_actions) { 
for{size_t i=0; i< valid_actions.size(); i++) 
if (valid_actions[i] == action) 
return true; 
return false; 


size_t get_gold_actions(Search::search &srn, uint32_t idx, uint32_t n){ 
task_data *data = srn.get_task_data<task_data>(); 

v_array<uint32_t> &action_loss = data->action_loss, sstack = data->stack, &gold_heads=data->gold_heads, 
&valid_actions=data->valid_actions; 

if {is_valid(1,valid_actions) &&( stack.empty() I I gold_heads[idx] == stack.last ())) 
return 1; 

if {is_valid(3,valid_actions) && gold_heads[stack.last{)] == idx) 
return 3; 

for(size_t i = 1; i<= 3; i++) 

action_loss[i] = (is_valid(i,valid_actions))?0:100; 

for (uint32_t i = 0; i<stack.size()-1; i++) 

if (idx <=n && (gold_heads[stack[i]] == idx || gold_heads[idx] == stack[i])) 
action_loss[1] += 1; 

if(stack.size 0>0 && gold_heads[stack.last()] == idx) 
action_loss[1] += 1; 

for(uint32_t i = idx+1; i<=n; i++) 

if (gold_heads[i] == stack.last () I I gold_heads[stack.last()] == i) 
action_loss [3] +=1; 

if (stack.size 0>0 && idx <=n && gold_heads[idx] == stack.last ()) 

action_loss[3] +=1; 

if (stack.size ()>=2 && gold_heads[stack.last()] == stack[stack.size()-2]) 
action_loss[3] += 1; 

if (gold_heads[stack.last()] >=idx) 
action_loss[2] +=1; 
for(uint32_t i = idx; i<=n; i++) 
if (gold_heads[i] == stack.last()) 
action_loss[2] +=1; 

// return the best action 
size_t best_action = 1; 
for(size_t i=l; i<=3; i++) 

if (action_loss[i] <= action_loss[best_action]) 
best_action= i; 
return best_action; 


void setup(Search::searchs srn, vector<example*>& ec) { 
task_data *data = srn.get_task_data<task_data>(); 

v_array<uint32_t> &gold_heads=data->gold_heads, &heads=data->heads, &gold_tags=data->gold_tags, &tags= 
data->tags; 

uint32_t n = (uint32_t) ec.sizeO; 
heads.resize(n+1, true) ; 
tags.resize(n+1, true) ; 
gold_heads.erase (); 
gold_heads.push_back (0); 
gold_tags.erase (); 
gold_tags.push_back (0); 
for (size_t i=0; i<n; i++) { 

v_array<COST_SENSITIVE: : wclass>5t costs = ec [i]->1. cs . costs; 

uint32_t head = (costs.size () == 0) ? 0 : costs[0] .class_index; 

uint32_t tag = (costs.size () <= 1) ? data->root_label : costs[1].class_index; 

if (tag > data->num_label) { 

cerr << "invalid label " << tag << " which is > num actions=" << data->num_label << endl; 
throw exception 0; 

} 

gold_heads.push_back(head); 
gold_tags.push_back(tag); 
heads[i+1] = 0; 
tags[i+1] = -1; 

} 


for(size_t i=0; i<6; i++) 

data->children[i].resize(n+1, true) ; 

} 


void run (Search::searchs srn, vector<example*>& ec) { 
task_data *data = srn.get_task_data<task_data>(); 

v_array<uint32_t> &stack=data->stack, &gold_heads=data->gold_heads, &valid_actions=data->valid_actions, 
&heads=data->heads, &gold_tags=data->gold_tags, &tags=data->tags, &valid_labels=data->valid_labels; 
uint32_t n = (uint32_t) ec.sizeO; 
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Stack.erase(); 

stack.push_back((data->root_label==0)?0:1); 
for{size_t i=0; i<6; i++) 
for(size_t j=0; j<n+l; j++) 
data->children[i][j] = 0; 

int count=l; 

uint32_t idx = ((data->root_label==0)?1:2); 
while (stack.size()>l I I idx<=n){ 
if (srn.predictNeedsExample()) 
extract_features(srn, idx, ec); 

get_valid_actions(valid_actions, idx, n, (uint32_t) stack.size (), stack.size()>0?stack.last{):0); 
uint32_t gold_action = get_gold_actions(srn, idx, n); 

// Predict the next action {SHIFT, REDUCE_LEFT, REDUCE_RIGHT} 
count = 2*idx + 1; 

uint32_t a_id= Search::predictor(srn, (ptag) count).set_input( * (data->ex)).set_oracle(gold_action). 
set_allowed(valid_actions).set_condition_range{count-1, srn.get_history_length(), 'p' )•set_learner_id 
(0).predict(); 
count++; 

uint32_t t_id = 0; 
if (a_id =-2 || a_id == 3){ 

uint32_t gold_label = gold_tags[stack.last()]; 

t_id= Search::predictor(srn, (ptag) count).set_input( * (data->ex)).set_oracle(gold_label).set_allowed 
(valid_labels).set_condition_range(count-1, srn.get_history_length(), 'p') .set_learner_id(a_id-l). 
predict(); 

} 

count++; 

idx = transition_hybrid(srn, a_id, idx, t_id); 

} 


heads[stack.last 0] = 0; 

tags[stack.last 0] = data->root_label; 

srn.loss((gold_heads[stack.last()] != heads[stack.last()])); 

if (srn.output().good()) 
for(size_t i=l; i<=n; i++) 


srn.output 0 << (heads[i] )<<": "<<tags [i] << endl; 






